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Test scores play a major role in the public’s perception of how well 
schools educate students. Recent surveys of test use in schools have 
indicated that a prime use ofystandardized achievement test date is for 
accountability, that ia, reporting to. the public (e.g., see case studies in 
Hathaway, 1983, and Herman & Dorr-Bremme, 1983). In the media, schools are 
continually judged according to student test scores. The recent widely . 
publicized study of education in the United States by the prestigious National 
Commission-on Excellence in Education (1983) listed thirteen indicators of "a 
nations at risk:" Ten of those indicatora involve test reaults. Yet, with 
all this concern for how well schools are doing, little attention has been 
paid to how well the tests are doing. Are they fair criteria by which to \ 
judge education? What do they really tell us? 


We make the preceding point to emphasize the importance of construct 
validity in test use. Construct validity ia more than an intellectual issue 
for test developers; when tests are interpreted and used by the public, it 
becomes a public issue. Cronbach (1980) has stated that teat interpreters are 
responsible for -validating a test for a particular use, but that neasureaent 
professionals aust.provide information that helps clarify what tests neasure, 
so that interpreters of tests might use thea more wisely. Our purpose in this 
paper is to further the cause of clarifying construct interpretations of 
tests, by proposing an alternative technique for analyzing underlying test 
structure. Factor analysis ia the method most commonly used for this purpose. 
However, a variety of problems are associated with it. We propose that non- 
metric multidimensional scaling (MDS) may be more. useful than factor analysis 
or other datent atructure nodels for investigating the internal structure of 
tests. We then demonstrate the utility of this technique by applying it to 
data from a widely used standardized test of reading comprehension. 


Background 


Comparison of MDS and Factor Analysis 


Multidimensional asacaling ia a data representation technique for showing 
the relationships between objects by locating them as points in a continuous ‘ 
space (Kruskal & Wish, 1978). In the spatial representation, sinilar objects 
appear Closer together and dissimilar ones appear farther apart. The 
technique has been used for a variety, of psychological purposes, ‘and may be 
applied to any set of objects for which measures of siailarity (or 
dissimilarity) are available. The method requires few absuaptions--basically, 
that the similarity measures can be represented as Euclidean distances. 

Unlike factor analysis, it requires no special assumptions about the 
underlying processes giving rise to the similarity date. 


The most common procedure for examining the internal structure of single 
tests, or the structure of batteriea of tests, has been factor analysis. 
Unlike multidimensional scaling, factor analysis sets forth and fits an 
explicit model.to @ matrix of covariances or correlations. It makes the 
strong assuaption that each observed variable is a weighted sum of sone saall 
number of common, unobservable variables called factors. In addition to the 


a as 
common factors, there is a different unique factor for each observed variable. 
These unique factors are assumed to be uncorrelated with the comaon factors 
+ andwith each other. As a consequence of these assuaptions, the unique 
factors contribute to their respective variables’ variances, but not to their 
covariances. The covariances: are entirely due to the common factors. 
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Factor analysia requires the use of covariances or iced a ative as * 
measures of association between variables, and depends upon the special 
mathematical properties of these neasures. Problems of estimating 
communalities, determining the correct number of factors, rotating to siaple 
structure, and naming the factors are not subjective, but, in principle, admit 
to only one correct solution. In contrast, nonmetric multidimensional scaling 
allows to use of ordinal meastires of association, and the user is free to 
choose whatever measure best captures the interesting features of the data. 
The choice of the number of dimensions in which to represent the data and the 
orientation of the axea of the coordinate system are natters of informed 
judgment. ‘There is no commitment to a “true” nuaber of dimensions or a “true” 
coordinate system. In sua, the aim of factor analysis is to fit a specific 
model to the data; the aim of multidimensional acaling ia to represent the 
data. 

Special Probleas With Dichotomous Itea-leve] Data 

Special problems are associated with factor analyzing dichotomous data. 
Specifically, dichotomous data cannot be aodeled perfectly under the 
asauaptiona%of factor analysis. That da, when variables can teke on only Ee 
diacrete values, they cannot be well described as the weighted sum of 
continuous factors. While there have been e variety of ed hoc solutions to 
this proBlea, the nost defengible approach has been to apply the factor 
analytic model, not directly to the observed dichotomous variables, but to. 
hypothesized unobservable continuous variables corresponding to each aenifest 
response (Christopherson, 1975; Muthén, 1978). Though mathematically elegent, 
these modela have been linited in their application to feirly snell item sets 
(up to about 20 items) due to technical problems of estimation. Moreover, the 
modela entail the additional esauaptions that the factor scores and the 
hypothesized continuoua variables have aultivariate normel distributions. 

In addition to probleas arising from the dichotomous nature of iten 
response data, individual teat items are inherently less reliable thén whole 
teats. Small fluctyationa in en exaninee’s attention or perforsence during a 
test can have a big impact on the reponse to ea-single item, but ere unlikely 
to significantly affect the totel test ecore. This inherent unreliebility is 
of concern in any item-level analyses, whether MDS, factor analysis, or sone 
other technique. In order to obtain a stable solution, analyses of individual 
items aust eaploy larger samples of exaninees than analyses of test scores. 

& Hethod 


Data ’ 


0 ; 
These analyses use data from the Reading subtest of the Metropolitan 
Achievement Test (MAT), Elementary Battery, Form F (Durost, Bixler, Prescott, 

Wrightstone, & Balow, 1970). This test, which is appropriate for assessing 
fourth-graders, consists of eight short pieces of text, each followed by four 
to eight 4-option multiple-choice questions. The questions are designed to 
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require: comprehending the literal meaning of information in the.text; drawing 
_ inferences from the passage; identifying the beat name or main: idea of the 
passage; or determining the meaning of a word in context (Prescott, 1973). 
There are 45 ites on the test. 


Data were taken fice the public/use tapes of the Noraing Saaple data fron 
the Anchor Teet Study (Loret, Seder, Bianchini & Vale, 1974). The tape 7; 
contains test information for a nationally representative saaple of 
approximately 63,000 fourth-grade children in over 400 schools. A systematic 
sample of every thirtieth record on the tape (#=2089) was used for these 
analyses. This saapling procedure assured proportionate representation of all 
the schools in the original sample because the tape was sorted by school. 
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Multidimensional scaling requires, as input,* measures of ‘association 
between all the objects to be acaled. Therefore, we needed an appropriate 
estimdte of similarity between test items. The phi coefficient, which is the 
product-moment correlation between pairs of dichotomous variables, is often 
used as 4n index of association between test items. However, it suffers the 
‘serious limitation that ite meximum possible value depends criticelly upon the 
item difficulties. The phi coefficient can reach a value of one, only if the 
- two items being correlated are equally difficult. The upper limit on the phi 
coefficient is 0.82 for two items with difficulties as similar as 0.5 and 0.6; 
the upper limit drops to 0.65 if the difficulties are 0.5 and 0.7. Therefore, 
the use of phi coefficients in MDS would be ee to result in an artifactuel - 
dimension of difficulty. \ ; 

t . 

Tetrachoric correlations were selectéd for use in this study because they 
are leas sensitive to differences in item difficulties than phi coefficiente 
and many other measures of association (Carroll, 1961). In general, the use 
of tetrachoric correlations has bean challenged on several grounds. Sone 
researchers have clained that they are difficult to compute, thet @ saaple 
matrix of tetrachoric correlations nay not be Gramaian, thet they have large 
standard errors relative to product-moment correlations, and that they ‘fequire 
hypothesizing unobservable continuous variables corresponding to each aanifest 
binary variable. For our purposes none of these objections is sound. First, 
several efficient computational algorithms heave been developed, so that 
calculating tetrachoric correlations is feasible with computing resources 
routinely available today. Second, the fagt thet a matrix of seperately 
calculeted saaple tetrechoric correlations is sonetines non-Greanien is e@ 
technical problem of eatimetion, not a substentive problem. Algorithas aight ° 
be constructed for constrained maxinum likelihood estimation of positive seni- 
definite tetrachoric correlation aetrices following a procedure similar to 
that of Bock and Petersen (1975). (The sample tetrechoric correlation matrix 
we calculated for these data was Grammian.) Third, regarding the standard 
error of tetrachoric correlations, the seaple size used in this stutly (N=2089) 
was sufficient to presume adequate precision. Finally, the assuaption of 
underlying continuous variebles is irrelevant for oy. MDS. The 
technique requires only that the rank ordering of the dinilerities between 
variables be accurate. As long as this requirenent is net, assuaptions ebout 
underlying distributions of skills, whether accurete or not, are 
inconsequential. 


Ps 


We, calculated the matrix of tetrachoric ae ela @ special- 


urpose FORTRAN program. The algoritha used was due to Saundets end iaproved 
-vie Newtonian iteration, as described and implemented by Froese] (1971). The 
program calculated correlations based only on actual responses. That is, 
omitted responses were dropped froa the a wivaia: rether then being scored as 
incorrect. This technique avoids computing, correlations that ere spuriously 
high because some examinees do not have time to complete the teat. 


Analyses. _ : fm. 


The matrix of tetrachoric correlations was ecaled using the non-netric 
procedures of the KYST computer eaves (Kruskal, Young, & Seery, 1973). This 

program estimates the best fine configuretion of points from a given starting 
configuration by an iterative procedure designed to reduce stress (i.e., the 

mismatch between the rank orderings of the similarities and the calculated 

distances in the configuration), and then rotetes the axes to principel 

components. The analysis was exploretory, and was conducted with no a priori 

notions about the number of dimensions that might be needed to represent the 

deta. Therefore, we scaled the det® in from one to six dimensions and looked 

at both the level of stress and the configuration of points in each solution 

tb decide how to best represent the dete. Initial enelyses used Kruskel’s 

stress formula 1 (SF)) and the Torece starting procedure. Later, analyses 

used Kruskal’s stresdé*foraula 2: different starting configurations to see 

how the outcomes would compere, end as a check egainst the problea of local 

minima. The final representetion of the dete was selected based on stress . 
_ information,: viasdelizebility, end the IPESEPEGERRLEATY of the verious 

solutions. 


We sought patterns in the final configuration based on the following ites 
characteristics: discrimination, difficulty, location of the item in the test, 
passage dependence, and item type. Iten discrimination wes measured by the 
point-biserial correlation between each item and the total teat acore. Iten 
difficulty was defined ea the proportion of exeninees thet eanawered en ites 
correctly. Location refers to items claasified es being at the beginning, 
niddle or end of the test, depending on whether they ere esaociated with the 
firat three, middle two, of final three pessages of text. Passage dependence 
was deternined fron results of a etudy by Tuinmen (1973, personel 
comaunication, Noveaber, 1981) in which he geve the itens froma thie teat, but 
not the corresponding passeges, to 1200 fourth-greders across the state of ( 
Indiana. We used, as the measure of passage dependence (pd), the proportion 
of Tuinman’s examinees that could anawer an item correctly without seeing the 
associated passage. Tuinman’s sample, though large: end broedly 
representative, is not strictly compareble to ours, so the value of pd may 
differ somewhat in our sample. However, for our analyses, this difference is 
unimportant as long as the rank order of pd velues is similar in the two 
samples. Item type-:refere to one of four types of itena thet’ the publishers 
have identified in the test (“Content outlines,” 1971): literal comprehension, 
inference, main idea, and vocabulary. Our judgment of item type (completed 
before the date were scaled) corresponds to the test publisher’s judgment for 
all but five iteas, which the publisher labeled as measuring literal 
comprehension and which we think require sone degree of inference. These five 
items will be distinguished when the results are presented. 


We hypothesized that several other item characteristics (e.g., distractor 
similarity, syntactic complexity, end memory load) aight be important to the 
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underlying test structure. However, the items did not vary systematically. 
along these dimensions, and so we were unable to classify ther reliably and’ 
unambiguously eccording to these features. Consequently, these 
characteriatica were omitted from the present analyses. 


Results end Iaterpretetica 


This section comprises two parts. The first part reports sone general 
featurea of the data, describes how we selected a final NDS configuration, and 
suggests that a two-dimensional representation cen reveal auch of the 
underlying structure of data, even if more than two dimensions are apparent in 
the data. The second part describes and interprets patterns in the data, 
based on item characteristics. | : 


Selection of 9 Representation 

Different atarting configurations can yield different MDS 
representations. One strives to find the configuration of points that best 
captures the underlying structure of the data. Our selection of a “peat” 


representation was based on a low level of stress, interpretability jof the 
OnE tAUE SELON, and ease of vizualization. ‘ 


Stress. Figure 1 plots the minimum stress at each level of analysis by 
the number of dimensions. As one can see, the stress level falls at first, 
but then tepers smoothly as dimensions are added, giving little indication as. 
to how many dimensions are needed to represent’these dete. Clearly, one . 
dimension ($F120.313) is inadequate;. at least two (3F1*#0.203) and possibly 
three ($F1"0.159) are necessary. In more than three dimensions, atress 
decreases slowly, so it is uncleer whether or not these dimensions might be 
neaningful. In thie case, informetion other than stress level is perticularly 
important for deciding how many dimensions are epperent in these dete. 


- Analysis of item content revealed that the higher 


dimensions that are identified by roteting axes to principal coaponents seen 


to bé pulling out individual points or “opposing sets of pointe thet ere 
unrelated to each other in terms of any features we could identify. . These 
points do not seem to define dimensions in the data as a whole, but rather 
they seea to take advantage of the space created by additional dimensions to 
move away from the other points. This interpretation is supported by the fact 
that dimensions are not consistently defined by the seme points as the 


representation aoves into higher dimensionel spece. Kor exaaple; in the 
‘three-dimensional solution, Dimension 3 is represented by Itens,2, 17 and 44 


at one extreme, and by Item 15 at the other extreme. When the solution goes 
into four dimensions, Dimension 3 is defined by Item 2 at one end and item 37 — 
at the other. Items 15 and 44 have noved into Dimension 2 and Item 17 hes 


moved into Dimension 4. As we include sore dimensions, a pettern energes: 


there are seven to nine points (about 17% of the items on the test) thet tend 
to go off in their own directions, leaving the mejority of points behind. We 
suspect, for two reasons, that this outcome is not siaply due to error in the 
date but, rather, represents true idiosyncracies in these itené. First, the 
sample is large and nationally representetive.. Thus, we expect that if this 
experiment were repeated most of the sane points would appear a’ outliers. 
Second, in reading the items, prior to scaling, we hypothesized thet several 
of these outlying itens neasure something other than (or in. addition to) 
reading coaprehension. 
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# OF DIMENS TONS : 4 
o . 
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te $a? oF Stress (Mruckal’s stress formula, 1) by nurber of dirersiors 
sulcidimencsionaN staling. (ata fror.°eading. suttest of “tetropalitan 
Acnievement Tests, Elementary Battery, orm ©.) § 


As deacribed above, the é@xes identified by rotating the 
cofiguration to principal components seer to be defined by idiosyncratic 
outlying items. Therefore, one would not necessarily expect thea to be 
psychologically meaningful. Indeed, apert from Dimension 1, the principal 
axes identified do not seem to capture the psychologically isaportant 
dimensions in these date. This does not netter in two dimensions, because one 
may draw axes wherever they seen appropriete. However, in three or aore 
dimensions, the axes chosen become crucial: if they ere not the 
psychologically interesting ones, it is very difficult to visualize the data 
eo that the important dimensions can be identified. Therefore, we chose to 
present the data in two dimensions, expecting that the major trends will be 
appearent--though the dimensions may not eppear orthogonal to eath other--and 
that thia representation is less likely then higher dimensional onda to hide 
{aporteant patterns. To no surprise, the seven to nine points thet tend to go 
off in thetr own dimengions move toward the periphery of the two-dimensional 
representation. ‘ 


- Using verious starting configurations, we found 
three constellations that fit the date equally well in two dimensions (for 
each, $F120.203). As would be expected, these solutions are quite similar, 
varying in the position of only a few points. We had one dilemma in selecting 
which configuration to present here. There is one item thet aoves a 
substantial distance between solutions, appearing in two distinct regions. As 
one would hope, both of these locations make sense if terms of the iten’s 
characteriatica. Ultimately, we chose the configuration in which this iten 
seens to be located according to its type. However, we will describe both 
subatantive interpretetiona for this peripatetic item when we discuss the 
patterns in the data, below. - 
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Figures 2 and 3 depict the selected two-dimensional representation of the 
data with itema identified according to different cherecteriatica. These 
characteristics reveal several atrong trends. 


- One expects to see a clear relationship between 
item discrimination and the MDS representation because both indicate how 
similar each item ia to all other items simulteneously. Item discrimination, 
as aummarized by the point-biserial correlation, estimates the degree;of 
asgociation, between gach item and the cosposite of all items on the teat. NDS 
locatea objects in space according to their similarity with eech other object 
in the space. Thus, though these suaseries are*coaputed in very different 
ways, they tell similar stories. Specifically, the aost highly discrisinating 
items (i.e., those that share aost in comaon with the other iteas) should 
cluster in the center of the MDS representation, and the least discrimineting 
({.@., the most idiosyncratic) items should fell around the periphery. Itenas 
of intermediate discrimination should fell-along a gradient in between. 

Figure 2 (the MDS configuration with points coded by deg of iten 
discrimination) does, indeed, exhibit this pattern, and con ributes evidence 
that this NDS representation accurately reflects the structure of the data. 


Item difficulty. Figure 3a (with points coded according to the 
proportion of exeninees anawering correctly) shows e distinct trend according 
to item difficulty. The easiest itens (p»0.80) cluster together, and 
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Figure 2. Multidimensional scaling of 45 items from the Reading subtest of the 
Metropolitan Achievement Tests, with items identified according to level of 
discrimination (point-biserial correlation of item with total test score). 
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Figure 3. Multidimensional scaling of 45 items from the Reading subtest cf the 
Metropolitan Achievement Tests, with items identified according to (a) difficulty 
(proportion correct), (b) location in test, (c) passage dependence (proportion 
correct without passage), and (d) item type. Arrow in (b) indicates alternative 
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progressively harder items fall into nested egg-shaped rings around thea. ~ 
Notice that the items sre not evenly distributed inside the rings; rather, 
they are nore densely pecked on the left. Indeed, if one imagines a vertical 
line drewn at the right edge of the cluster of easiest iteas, the points to 
the left of it fan out along the dimension of item difficulty, and only ix 
widely scattered points fall to the right of that vertical line. The obvious 
question at thia point ia: what features diatinguiah the itema on the left and ‘ 
right, that ia, why don’t the six dispersed points fall in with the rest? 
Apparently, these items each measure something idiosyncratic. Indeed, the 
three most widely scattered ones have the lowest point-biserial correlations 
of all items on the test, and two nore have quite low point-biserials (compare 
Figure 2). Not surprisingly, these five points move into higher dimensions if 


allowed to. r 
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Locationin_thetass.- The data also show a strong pattern according to 
the location of the itemg in the test (Figure 3b). Items from the .beginning 
of the test (those associated with the first three passages) generally fall 

into the lower right portion of the configuration; itema from the end (those 
associated with the final three passages) fall in the upper left; and itens 

from the aiddle two passages generally fall along a diagonal line separating 
the early and late items. Notice the arrow in Figure 3b that indicates the 

two locations of the “peripatetic” item mentioned in the section on selecting ’ 
a representation. This item, which is from the aiddle of the test, can be 

situated closer to the other middle items without altering the level of streas 

in the totel configuration. -— ‘ 


It is not surprising that Figures 3a and 3b (item difficulty and 
location) show a degree of correspondence, because the test is designed to 
-have easier questions at the beginning and harder queationa at the end. 
However, the correspondence is not perfect; sone items near the beginning of 
the test are nore difficult, and these appear on the opposite side of the 
configuration than the itens that are more difficult from the endxof the test. 
Also, the difficult early items spread apart from each other, indicating that 
they may be measuring idiosyncratic skille or khowledge. This interpretation 
is, again, supported by the fact that these peripheral items move out into 
their own space if more dimensions are added to the analysis. The difficult 
items from the end of the test, with a few exceptions, are nore densely 
packed. This pattern suggests that aost of these items aneasure relatively 
similar knowledge or skilla. However, three of the last items seen to measure 
idiosyncratic skille--they sit around the periphery, and tend to aove into 
their own higher-dimensional space when it is available. j 


. Figure 3c shows the configuration with itens 
identified according to passage dependence, i.e., Tuinman’s estimate of the 
proportion of examinees who can answer items correctly without reading the 
passage (pd). The pattern here is striking. Moat of the items that can be 
anawered easily without reading the passages--that is, items which over half 
of Tuinman’s sample answered correctly without the passages available--(solid 
circles in Figure 3c) cluster in the middle of the configuration. Indeed, the 
central i almost exclusively of this type of item. Of the most 
passage depend@nt items--items that 40% or fewer of Tuinman’s exaninees could 
answer correctly without reading the passage--(open circles in the figure) 
only four fall in with this group, and their presence might be explained. Of 
these four items, two are vocabulary itema that may require the presence of 
the passage, but do not require a thorough reading of it. It is possible to 
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answer these items simply by locating the bold-faced word in the pagsage and 
comprehending. its meaning within a single phrase or sentence. The other two 
are literal comprehension itema-that can be anawered by reading only the final 
sentence in the passege. (In one case, this sentence inmediately precedes the 
question.) ; . 


It seens that the MDS representation centers around a large and 
relatively dense cluster of itens that might be anawered with little or no 
reference to the passage with which they are associated. Contept analyses of 
these itens suggests that half of them might be answered using general 
knowledge (e.g., that many diseases are caused by geras, not acientiats, 
penicillin or medicine), and the other belf probably require some aspect of’ 
test-wiseness (e.g., gleaning information from other queations, or second- 
guesaing the test writer to choose the aost plausible sounding option). 


Item type. Figure 3d plots the MDS configuration with items identified 
by type. (The itens thet the publisher lebeled as neasurfmg literal 
comprehension and which we think require sone degree of in¥erence are 
identified in the figure as jiteral/inference.) Inference items, which are 
post nuaerous, scatter about the configuration. All of the items that we 
judged as requiring Yitdral comprehension fall within the center of the 
configuration. The five {tems on which we disagreed with the test publisher _ 
are scattered about the figure, one of them lying close to the center. The 
test conteins five vocabulary items thet require the examinee to identify the 
meaning of a word in the passage (represented in Figure 3d as circled dota). 
Three of these {tens cluster tightly together in the centér of the solution. 

A fourth lies nearby (it is aladst as close to the other vocabulary items as 
it is to anything), and the last one sits off by itself in the upper right 
corner. Thia final item ja the aost unusual item on the test: it has the | 
loweat point-biserial correlation and moves out furthest when new dimensions 
of space are added. Finally, the teat contains four itema that esk for the 
best name or the agin idea of the passage (solid triangles in the figure). 
Three of these items radiate out in a line to the left of the center. The 
fourth sita on the opposite side of the large central cluster of -pointa. 


In sua, item type seens to explain less of what is going on in the data Es 
than any of the other features identified, partly because there are so few 
items designed to measure anything other than literal comprehension or 

-Tnference ability. The nain idea and vocabulary items may neasure two 

speciNc skills: Three of the four main idea items seen to relate aore 
closely\to each other then they do to other itens. Three of the ee 
vocabulaky items form a densely packed group, indicating that examinees who 
anawer on thea correctly also tend to pass the other two. (This result is 
not necessarily expected, because vocabulary items measure understanding of 
distinct words, and knowledge of one word need not iaply knowledge of 
another.) <A fourth vocabulary item seems to be somewhat related to the other 
three, but it is somewhat more difficult. The fifth appeara to be quite 
different from anything else on the test. It is leas clear what the literal 
and inference itene neasure. The itens that we identified as requiring 
literal comprehension all fall within a central group of points that seea to 
require little or no reading of the passage (compare Figure 3c). Inference 
items and the other “literal” items scatter about the MDS representation and 
epperently measure individual skills or knowledge. 


Diecussion 


Multidimensional scaling of these reading test data reveals sone 
surprising information about the underlying structure of the test. The teat 
centers around a stable set of items that apparently can be anawered with 
little or no reference to the associated passages. This result is surprising 
and makes sense at the sane tine. It ia surprising because one does not 
expect a test of “reading akill” ta contain so many itens for which reading 
the text is not required. The configuration makes sense because the items in 
the center should have aost in common with the entire set of itema, and 
therefore ought to require some general skills or knowledge. In this case, 
the general abilities seea to be posseasion of some specific common knowledge, 
reading and comprehension of multiple-choice test items, and perhaps a bit of 
teat-wiseneas. 


Host of the more difficult items scatter outside the central group. It 
appears that slightly different specific abilities are required to solve each 
of these items. This result is consistent with an original prediction made by 
Guttman for the structure of mentel ability teat batteries. Guttman (1954) 
described a hypothetical radex structure, with simpler tests in the.center and 
progressively more coaplex tests radiating out. It is significant that 
Guttman (1965), and aore recently Marshalek, Lohaan, and Snow (1983), did not 
find this result when they scaled batteries of nental tests. Rather, they 
found that more complex (Guttman called thea “rule-inferring”) tests fell in 
the middle and simpler (or “rule-applying”) teste fell near the periphery. 

The apparent contradiction between these results and ours may be a nmatter of 
word usage. Our results are entirely@consistent with those of Guttman and 
Marshalek et al. if one thinks in terns of a continuum of generality, that is, 
with teata (or items) that require general abilities in the aiddle and tests 
that require specific bilities radiating toward the outside. 

: e - 

A facet of item difficulty that corresponds somewhat to item location in 
the test also seens apparent in these data. The fact that the two facets 
correspond iaperfectly indicates that the determinants of iten difficulty are 
different for earlier and later items. Specifically, the early difficult 
itena seen to neasure idiosyncratic skills or knowledge. Most of the later 
ones seen to neasure skills that probably involve high levels of vocabulary, 
syntactic complexity, passage length and abstractness of ideas. A few of the 
later difficult items nay measure idiosyncratic skilla or knowledge. 


Information about item type reveals least about the test since 80x of the 
itens are designed to neasure either ~ * coaprehension or ability to make 
inferences. Most of the literal coapre sion itema fall in the middle of the 
representation, and so seen to acasure a general {lity on this test. We do 
not know whether they ectually measure literal comprehension ability or Me 
general knowledge and test-wiseness because nost of jthese itens can be bi 
' answered by many exeaninees without reading the pasad4ge. The inference itens’ 
spread apart from each other, and do not seer to neasure a unitary skill. 


,Suanery 


This paper makes three primary contributions to‘the study of construct 
validity. First, it proposes that nonmetric multidimensional scaling, with 
ita limited easuaptions, ia ea useful technique for representing the 
interrelationships between itens in a test. Second, it suggests that typical 


\ 
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problems associated with scaling dichotoanous variables can be avoided by using 
tetrachoric correlations as input to the nultidimensional scaling. Finally, 
it demonstrates the utility of the suggested procedures by applying thea to 
actual data from the Reading subtest of the Metropolitan Achievement Tests. 


References 


Bock, R. D., & Petersen, A. C. (1975). A multivariate correction for 
attentuation. Biometrika, 62, 673-678. 


Carroll, J. B. (1961). The nature of the data, or how to choose a correlation 


‘coefficient. Paychometrika, 26 (4), 347-372. 


Chriatopheraon, A. (1975). Factor analyais of dichotomized variables.: 
Paychometrika, 40, 5-32. 


Content outlines (Metropolitan Achieveasent Tests Special Report, 1970 Edition, 
Report No. 2). (1971). New York: Harcourt, Brace, Jovanovich. 


Cronbach, L. J. (1980). Validity on parole: How can we go straight? In W. B. 
Schrader (Ed.), Measuring achievesent: Progress over 9 decade (New 
Directions for Testing and Measurenent, No. 5, pp. 99-108). San 
Francisco: Jossey-Bass. 


Durost, W. N., Bixler, H. H., Wrightstone, J. W., Prescott, G. A., & Balow, 
I. H. (1970). Metropolitan Achievesent Tests, Form F (Elementary Level). 


New York: Harcourt, Brace, Jovanovich. 


Froemnel, E. C. (1971). A comparison of computer routines for the calculation 
of the tetrachoric correlation coefficient. Paychometrika, 36, 165-174. 


Guttman, L. (1954). A new approach to factor analysis: The radex. in P. F. 
Lazerfeld (Ed.), ia a ca amare acl ce 


IL: Free Press. 


Guttdan, L. (1965). The structure. ofginterrelations among intelligence tests. 
In 
Princeton, MJ: Educational Teating Service. 


Hathaway, W. E. (Ed.). (1983, September). Testing in the echoole (New | 
Directions for Teating and Measurenent, No. 19). San Francisco: Jossey- 
Beas. 


Heraan, J. L., - Dorr-Bremme, D. W. (1983, September). Uses of testing in the 
achools: A national profile. In W. E. Hathaway (Ed.), Testing in the 
achogole (New Directions for Testing and Measurenent, No. 19, pp. 7-17). 
San Francisco: Jossey-Bass. 


Kruskal, J. B. & Wish, M. €1978). Multidimensional sceling. Beverly Hills: 
Sege. 
<@ 


Kruskal, J. B., Young, F. W., & Seery, J. B. (1973). How to use KYST, 9 very 
flexible program to do pultidimensional scaling and ynfolding 


Unpublished peper, Bell Telephone Leboratories. 


10 


» A. (1974). Anchor test 


Departaent of Health, 


Loret, P. G., Seder, A., Bianchini, J. C., & Vale, 


(qrades 4, 5, and 6). Washington, D. C.: U. S 
Education and Welfare, Office of Education. 


MNarshalek, B., Lohman, D. F., & Snow, R. E. (1 >. The coaplexity continuua 
in the radex and hierarchical nodels of intelligence. Intelligence, 7, 
107-127. 


Muthén, B. (1978). Contributions to factor analysis of dichotomous variables. 
Rarshonatrike, 43, 551-560. . 


National pauiae oa on Excellence in Education. (1983). A nation at risk: The 
\ iaperative for educational refer. Washington, D. C.: U> S. Governaent 
CY Printing Office. 


Prescott, G. A. (1973). == 
Teats. New York: Harcourt, Brace, Jovanovich. 


’ 
Tuinman, J. J. (1973). Determining the passage dependency of comprehension 


a questions in five major tests. Gras Gea 9, 206-223. 
vad 


11 — (6 


